# loading libraries
library(tidyverse)
library(lubridate)
library(nycflights13)
library(plotly)
# modifying chart size
options(repr.plot.width=5, repr.plot.height=3)
Visualization is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you'll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with. You'll learn how to do all that in this notebook, which will teach you how to transform your data using the dplyr package and a new dataset on flights departing New York City in 2013 (from nycflights13 package).

The pipe, %>%, comes from the magrittr package. Packages in the tidyverse load %>% for you automatically, so you don't usually load magrittr explicitly.
Pipe was designed to decrease development time and to improve readability of the code. magrittr provides a new “pipe”-like operator, %>%, with which you may pipe a value forward into an expression or function call; something along the lines of x %>% f, rather than f(x).
Example:
mpg %>% nrow
Equivalent to nrow(mpg). You might ask why bother to learn a new syntax when the conventional f(x) works just fine; the answer is that pipes become very handy as the number of operations grow. For instance we will see later in this notebook that we can pipe multiple steps together, rather than saving them in temporary variables and reusing them. E.g., x %>% f %>% g %>% h rather than h(g(f(x))).
When using a pipe we can pass the primary object to the function by a pipe and define the secondary function parameters inside as we would normally do. The following code is equivalent to matrix(rnorm(10), ncol = 2):
10 %>%
rnorm %>%
matrix(ncol = 2)
Let's make more random numbers and pipe a plot function at the end:
100 %>%
rnorm %>%
matrix(ncol = 2) %>%
plot
To explore the basic data manipulation verbs of dplyr, we'll use nycflights13::flights. This data frame contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights.
print(flights)
# flights # to see all the columns and more rows
dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
filter() picks cases based on their values.arrange() changes the ordering of the rows.select() picks variables based on their names.mutate() adds new variables that are functions of existing variablessummarize() reduces multiple values down to a single summary.These six functions provide the verbs for a language of data manipulation. These all combine naturally with group_by() which allows you to perform any operation "by group".
All verbs work similarly:
filter()¶filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:
filter(flights, month == 1, day == 1) %>%
print
You can see that I have used %>% print at the end. This is only because I am in Jupyter, in RStudio you can skip that part and still get a similar tibble output.
dplyr functions never modify their inputs, so if you want to save the result, you'll need to use the assignment operator, <-:
jan1 <- filter(flights, month == 1, day == 1)
R provides the standard suite: >, >=, <, <=, != (not equal), and == (equal) for performing comparisons. All of these can be used inside filter(). For instance the following code will filter out the flights without any departure delay:
flights %>%
filter(dep_delay <= 0) %>%
print
We can provide multiple comparisons to filter(). For this you'll need to use Boolean operators: & is "and", | is "or", and ! is "not". Figure below shows a complete set of Boolean operations.

The following code finds all flights that departed in November or December:
flights %>%
filter(month == 11 | month == 12) %>%
print
Alternatively we could use the following syntax:
filter(flights, month %in% c(11, 12))
Whenever you start using complicated, multi-part expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work. You'll learn how to create new variables shortly.
One important feature of R that can make comparison tricky is missing value, or NA ("Not Available"). NA represents an unknown value so missing values are "contagious": almost any operation involving an unknown value will also be unknown:
NA > 5
10 == NA
NA + 10
NA / 2
NA == NA
It's easiest to understand why this is true with a bit more context:
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
We don't know!
Use is.na() to determine if a value is missing
is.na(x)
filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:
(df <- tibble(x = c(1, NA, 3)))
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
Find all flights that
airline dataset to get the two-character carrier codes)# Your answer goes here
Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenge, part 4?
# Your answer goes here
How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
# Your answer goes here
arrange()¶arrange() works similarly to filter() except that instead of selecting rows, it does what it claims it does: changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day) %>%
print
Use desc() to re-order by a column in descending order:
arrange(flights, desc(dep_delay)) %>%
print
Missing values are always sorted at the end:
df <- tibble(x = c(5, 6, 2, NA))
arrange(df, x)
arrange(df, desc(x))
1) How could you use arrange() to sort all missing values to the start? (Hint: use is.na() and apply your answer to the df dataframe we created above).
2) Apply this answer to flights dataset to list the canceled flights first (assuming NA for dep_delay means the flight was canceled) and then show other flights starting with the ones with large dep_delay in a descending order.
# Your answer goes here
select()¶It's not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you're actually interested in. select() allows you to do that:
# Select columns by name
select(flights, year, month, day) %>%
print
# Select all colon between year and day (inclusive)
select(flights, year:day) %>%
print
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day)) %>%
print
There are a number of helper functions you can use within select():
starts_with("abc"): matches names that begin with "abc".ends_with("xyz"): matches names that end with "xyz".contains("ijk"): matches names that contain "ijk".num_range("x", 1:3): matches x1, x2 and x3.everything(): useful for bringing a few variables to the beginning of the dataframe and still want to keep everything else at the end.See ?select for more details.
select() can be used to rename variables (e.g., select(flights, y = year)), but it's rarely useful because it drops all of the variables not explicitly mentioned. Instead, use rename(), which is a variant of select() that keeps all the variables that aren't explicitly mentioned:
rename(flights, y = year, tail_num = tailnum) %>%
print
# everything() example
select(flights, time_hour, aritime = air_time, everything()) %>%
print
Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.
# Your answer goes here
# Your answer goes here
Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME")) %>%
print
mutate()¶Besides selecting sets of existing columns, it's often useful to add new columns that are functions of existing columns. That's the job of mutate().
mutate() always adds new columns at the end of your dataset, but doesn't change the original dataframe. If you want to keep the outcome of mutate(), or any of the other functions we discussed above, assign the output to a new or existing object.
# Starting with a smaller number of columns
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
print(flights_sml)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
) %>%
print
Alternatively we could've used our piping skills:
flights %>%
select(year:day,
ends_with("delay"),
distance,
air_time
) %>%
mutate(gain = dep_delay - arr_delay,
speed = distance / air_time * 60) %>%
print
Note that you can refer to columns that you've just created:
mutate(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
) %>%
print
Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they're not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
# Your answer goes here
summarize()¶The last key verb is summarize() (or summarise()). It collapses a data frame to a single row:
# Average delay
summarize(flights, delay = mean(dep_delay, na.rm = TRUE))
group_by¶summarize() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they'll be automatically applied "by group". For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:
# Average delay for each day
flights %>%
group_by(year, month, day) %>%
summarize(delay = mean(dep_delay, na.rm = TRUE)) %>%
print
More examples
How does the average distance to destination impact the arrival delay on average?
delay <- flights %>%
group_by(dest) %>% # group by destination
summarize(count = n(), # counting the number of flights (per destination)
Distance = mean(distance, na.rm = TRUE), # average distance (per destination)
Delay = mean(arr_delay, na.rm = TRUE) # average delay (per destination)
) %>%
filter(count > 20, dest != "HNL") # keeping destinations with more than 20 flights that are not "HNL"
# It looks like delays increase with distance up to ~750 miles
# and then decrease. Maybe as flights get longer there's more
# ability to make up delays in the air?
p <- ggplot(data = delay, mapping = aes(x = Distance, y = Delay)) +
geom_point(aes(size = count, color = dest), alpha = 1/3) +
geom_smooth(se = FALSE) +
labs(title = "How does the average distance to destination impact the arrival delay on average?",
caption = "The circle size shows the number of flights to that destination.")
p + theme(legend.position = "none")
ggplotly(p) # Same plot with ggplotly()
Number of flights during the summer
flight_count <- flights %>%
mutate(date = make_datetime(year, month, day)) %>%
group_by(date) %>%
summarize(count = n(), # counting the number of flights (per date)
delay = mean(arr_delay, na.rm = TRUE) # average delay (per date)
)
p <- ggplot(flight_count, aes(date, count)) +
geom_line() +
labs(title = "Number of flights during the summer") +
theme_classic()
p
ggplotly(p) # Same plot with ggplotly()
Interestingly, all of these dips are weekends. Let's check this by creating a bar chart:
flights %>%
mutate(date = make_datetime(year, month, day),
weekday = wday(date, label = TRUE)) %>%
group_by(weekday) %>%
summarize(count = n(), # counting the number of flights (per weekday)
delay = mean(arr_delay, na.rm = TRUE) # average delay (per weekday)
) %>%
ggplot(aes(weekday, count)) +
geom_bar(stat="identity", width = 0.5) +
labs(title="Number of flights per dat of week",
x = "Day of Week",
y = "Count") +
theme_classic() +
theme(axis.text.x = element_text(angle=65, vjust=0.6))
Look at the number of canceled flights per day. Is there a pattern? Is the proportion of canceled flights related to the average delay? (Hint: assume if dep_delay is NA then the flight is canceled)
# Your answer goes here
# Your answer goes here